Structure-based chemical ontology improves chemometric prediction of antibacterial essential oils

Plants are valuable resources for drug discovery as they produce diverse bioactive compounds. However, the chemical diversity makes it difficult to predict the biological activity of plant extracts via conventional chemometric methods. In this research, we propose a new computational model that integrates chemical composition data with structure-based chemical ontology. For a model validation, two training datasets were prepared from literature on antibacterial essential oils to classify active/inactive oils. Random forest classifiers constructed from the data showed improved prediction performance in both test datasets. Prior feature selection using hierarchical information criterion further improved the performance. Furthermore, an antibacterial assay using a standard strain of Staphylococcus aureus revealed that the classifier correctly predicted the activity of commercially available oils with an accuracy of 83% (= 10/12). The results of this study indicate that machine learning of chemical composition data integrated with chemical ontology can be a highly efficient approach for exploring bioactive plant extracts.

www.nature.com/scientificreports/Chemical ontologies provide a standardised and hierarchical chemical classes for chemical compounds.Especially, recent development of structure-based chemical ontology, which provides structured classifications of chemical entities into hierarchically arranged chemical classes, has made it possible to automate annotation of numerous compounds 15,16 .However, to the best of the authors' knowledge, these ontologies have not been used for the CAR studies until now.
In this study, we have shown that the structure-based chemical ontology has the potential to improve the active/inactive classification of antibacterial EOs.An overview of the study is shown in Fig. 1.The ratios of EO constituents that belong to each ontology class were summed up to create a new compositional feature.Feature selection was optionally performed to exclude irrelevant and redundant features and reduce the dimensionality.The all or selected features were trained to classify active/inactive EOs using a machine learning algorithm.We illustrate that the approach above improved the classification performance in two test datasets from literature and showed good performance in an antibacterial assay.

Data
A literature search on antibacterial EOs was performed using PubMed 17 and Google Scholar 18 in April 2021 and October 2023.The keywords "antimicrobial + AND + oil", "antibacterial + AND + oil", "bactericidal + AND + oil" and "microbicidal + AND + oil" were used for the search.The antibacterial test method was restricted to broth dilution, and the minimal inhibitory concentration (MIC) ≤ 1 mg/mL and > 1 mg/mL were interpreted as active and inactive EOs, respectively.The tested organisms were restricted to Staphylococcus aureus and Escherichia coli, the two most commonly studied bacteria for exploring the antibacterial activity of plant extracts 19 .The EOs from same plant species were removed to eliminate redundancy.
The chemical composition data of the antibacterial EOs were also retrieved through the literature search.Trace constituents with peak areas lower than 0.1% of the total ion chromatogram (TIC) were ignored.Chemical ontology (ChemOnt ver.2.1) classes corresponding to the EO constituents were obtained from ClassyFire web application 15 by inputting their chemical structures acquired from PubChem database 20 .The hierarchical structure of ChemOnt was also obtained from the ClassyFire web site.

Reagents
Acetone for gas chromatography was purchased from KISHIDA CHEMICAL Co., Ltd, Japan.Dimethyl sulfoxide (DMSO) and thymol (special grade, purity 100.0%) were purchased from FUJIFILM Wako Pure Chemical Corporation, Japan.A series of n-alkane standards (C 9 -C 40 ) was purchased from GL Sciences Inc., Tokyo, Japan.Mueller-Hinton II broth was purchased from Becton, Dickinson and Company, USA. S. aureus (NBRC 12732) for antibacterial activity tests was obtained from the National Institute of Technology and Evaluation, Biological Resource Center (NBRC), Japan.

Machine learning of chemical composition with chemical ontology
For each ontology class in ChemOnt, ratios of EO constituents that belong to the class were summed to create a new compositional feature.The EO samples were divided into a training dataset and a test dataset according to whether the paper was published before or after December 2020.The antibacterial labels and features described by chemical constituents with or without ChemOnt classes of the training dataset were subsequently learned by a random forest 21 to classify the active/inactive EOs against S. aureus and E. coli, respectively.The number of features used for feature subsampling (mtry) and splitting rule (split-rule) on the random forest were tuned by tenfold cross-validation using R 'caret' package (ver.6.0-93).Then, the labels of the test dataset were predicted by the classifier, and the output probabilities of the active/inactive classification were evaluated by a receiver operating characteristic (ROC) curve 22 .The classification performance was also evaluated by precision, recall and F1 score.The training and prediction steps were repeated 10 times, and paired two-tailed t-test was used to determine whether there was any difference in the area under the ROC curve (AUC) between the methods.To measure the performance for EOs predicted to be active with high output probability, partial AUCs were calculated using 'pROC' (ver.1.18.0)R package.

Feature selection
Feature selection was performed to identify a subset of features (chemical constituents and ChemOnt classes) that can optimally differentiate active and inactive EOs in the training dataset for S. aureus.In this study, hierarchical information criterion (HIC) 23 was employed as a feature ranking method that exploits the structure of hierarchical features.The HIC was originally developed to rank features with the number of patients in two groups.To apply the HIC algorithm to our data, we modified it as follows: (1) Mutual information estimator 24 was introduced to calculate mutual information between continuous (chemical composition) and discrete (activity label) data.The estimator uses nearest-neighbor method to avoid problems with binning continuous data.(2) Branch statistical significance (comparing each feature in a branch to every other feature in the same branch) and tree statistical significance (comparing each feature to every other feature in the hierarchy) were determined by pairwise t-test instead of 2-proportion z-test.Although the exact weights of the branch and tree statistical significances can be calculated using frequency of non-zero values, they were set to 0.5 for model simplification in this study.The modified algorithm is shown in Algorithm 1. Algorithm 1. Hierarchical information criterion (HIC) ranking for continuous dataset.
An ablation study was conducted to investigate how each component of HIC (line 6 of Algorithm 1) influences the AUC.Mutual information and each of the other terms (hierarchical level, branch statistical significance and tree statistical significance) of HIC were used to rank the features, and tenfold cross-validation within training dataset was performed using the top K features (where K is 2, 4, 8, 16, 32, 64, 128 or 256).

www.nature.com/scientificreports/
To evaluate the effect of feature selection on the prediction performance, random forest classifiers were constructed from the top K features ranked by the HIC score.For comparison, principal component analysis (PCA) followed by training with the first K principal components (PCs) was performed using the chemical composition data (without ChemOnt classes) to determine whether a simple dimension reduction based on the variance affects the classifier's performance.
For discussion of compositional difference between the training and test dataset, the high-dimensional EO composition data were embedded into a two-dimensional map using t-distributed stochastic neighbor embedding (t-SNE) 25 .The embedding was calculated using 'tsne' (ver.0.1-3.1)R package.

Gas chromatography/mass spectrometry (GC/MS) analysis
EOs from Daucus carota var.sativus and Santalum album were purchased from suppliers in Japan.Chemical characterization was performed in the same manner as that reported by the authors 26 using a gas chromatograph coupled with mass spectrometer model QP2010 (Shimadzu, Kyoto, Japan).The EOs were dissolved in acetone (2 μL/mL).This solution (1 μL) was injected in split mode (1:50 ratio) onto a DB-5MS column (30 m × 0.25 mm i.d.× 0.25 μm film thickness, Agilent, USA).The injection temperature was set at 270 °C.The oven temperature was started at 60 °C for 1 min after injection and then increased at 10 °C/min to 180 °C for 1 min, increased at 20 °C/min to 280 °C for 3 min followed by an increase at 20 °C/min to 325 °C, where the column was held for 20 min.Mass spectra were obtained in the range of 20-550 m/z.EO components were identified based on a search (National Institute of Standards and Technology, NIST 14), the calculation of retention indices relative to homologous series of n-alkanes, and a comparison of their mass spectra libraries with data from the mass spectra in the literature 27,28 .

Model evaluation by in vitro antibacterial assay
The antibacterial activity of the 12 commercially available EOs was predicted using the classifier for S. aureus with the best AUC and chemical composition data obtained by GC/MS analysis.The EOs were classified as active if the output probability ≥ 0.5, otherwise classified as inactive.The EOs from Daucus carota var.sativus and Santalum album were tested using the broth microdilution assay in the same manner as that reported by the authors 26 .A stock solution of each EO (dissolved to a concentration of 40 mg/mL in DMSO) was diluted to 4 mg/mL by Mueller-Hinton II broth medium, followed by serial dilution by the medium to lower concentrations (2, 1, 0.5, 0.25, 0.125, 0.0625, 0.0313, 0.0156 and 0.0078 mg/mL).Thymol, a known antibacterial agent, was dissolved and diluted in the same way to ensure microbial susceptibility as a positive control.The oils were all tested in triplicate.S. aureus NBRC 12732 was inoculated onto normal agar plates, and cultured for 24 h at 35 ± 1 °C.The bacterial suspensions were diluted with saline to obtain a 0.5 McFarland turbidity equivalent (ca. 10 8 colony forming units per mL (CFU/mL)), and further diluted 10 times with saline (ca. 10 7 CFU/mL).Then, 0.1 mL of EO-containing medium and 5 μL inoculum were added to sterile microtiter plates.10% (v/v) DMSO in the medium was used as a negative control to determine if the solvent exhibited any antibacterial effect.The microtiter plates were incubated for 18 to 24 h at 35 ± 1 °C.Based on the opacity and color change in each well, the lowest concentration capable of inhibiting the growth was determined as the MIC.

Data collection
The literature search identified 562 (270 active and 292 inactive) EOs for S. aureus and 495 (173 active and 322 inactive) EOs for E. coli with chemical composition data (Supplementary Table S1 and S2).1,329 chemical constituents belonging to 327 ChemOnt classes for S. aureus and 1,307 chemical constituents belonging to 336 ChemOnt classes for E. coli were reported to compose the EOs (Supplementary Table S3, S4, S5 and S6).Among them, 413 (215 active and 198 inactive) EOs for S. aureus and 360 (134 active and 226 inactive) EOs for E. coli were published before December 2020 (training dataset), and the other 149 (55 active and 94 inactive) EOs for S. aureus and 135 (39 active and 96 inactive) EOs for E. coli were published after that month (test dataset).

Machine learning of chemical composition with chemical ontology
The binary classifier models for S. aureus and E. coli were successfully constructed from the training dataset using composition data of all chemical constituents with/without integration of ChemOnt classes.

Feature selection
The ablation study for HIC showed that hierarchical level and branch statistical significance partially improved the AUC, whereas tree statistical significance did not yield the better AUC (Supplementary Figure S1A).Therefore, we omitted the tree statistical significance in the following evaluation.

Model evaluation by in vitro antibacterial assay
The Myroxylon balsamum var.pereirae EO was predicted to be inactive, and the other 11 EOs were predicted to be active by the classifier constructed from the top 32 features.Antibacterial assay revealed that the MICs against S. aureus were 1 mg/mL for Santalum album EO and 2.5 mg/mL for Daucus carota var.sativus EO.The MICs of the other 10 EOs were already reported in a previous study 26 , and are also summarized in Table 3.The MIC for thymol (positive control) was 0.25 mg/mL, which was equivalent to literature data (0.03 v/v % 30 ).No inhibition of bacterial growth was observed in the negative control.In total, the classifier achieved accuracy of 83% (= 10/12) and AUC of 0.704 (Supplementary Figure S2) for the commercially available EOs.Among the EOs, Thujopsis dolabrata was correctly predicted to be active though the main components (thujopsene 49.4%, cedrol 5.8%, β-bisabolene 4.2%) were distinct from either of the trained EOs.

Discussion
In this study, we collected the literature data of 522 antibacterial EOs composed of more than 1,300 compounds for machine learning classification.As expected from the high dimensionality of input data, the conventional chemometric classifier failed to show equivalent performance on the test dataset.The prior dimension reduction using PCA provided only a small improvement in the performance, probably because of high sparsity of the data   and low accumulated variance contribution of the PCs.This result indicates that chemical diversity of antibacterial plant extracts is difficult to represent by low dimensional vectors from only chemical composition data.
A principle that compounds with similar structures (common structural features) possess similar biological activities is well accepted and has been applied to structure-activity relationship researches in medicinal chemistry 31 .Our approach incorporates the principle into composition-activity relationship by creating a higherlevel feature set reflecting the similarity in molecular structures.Despite the small number of EO samples, the strategy constructed a robust classifier without significant overfitting in this study.Although the imbalanced (173 active vs 322 inactive) training dataset for E. coli caused low recall (0.2 to 0.3), the classifier retained good performance in ranking EOs ordered by the output probability (AUC = 0.72).Cost-sensitive learning or sampling technique may further improve the performance.
The 32 chemical classes ranked by HIC score (Table 2 and Fig. 3) provide insights in desirable chemical structures as an antibacterial agent.Phenols are a well-characterized class having antibacterial activity 32 .Carvacrol and thymol are the ones frequently found in antibacterial EOs, and have been used in dental applications and food flavorings 33 .Sesquiterpenoids, a structurally diverse class of C 15 compounds composed of three isoprene units, were reported to show good antibacterial activity with MICs lower than 1 mg/mL 34 .Fatty acids are also wellknown antibacterial agents in literature, and reported to show synergetic effect with several EO constituents 35 .In contrast, monoterpenoids (C 10 compounds composed of two isoprene units) and carboxylic acid esters (included in "carboxylic acid derivatives" class) were reported to have antibacterial activity with higher MICs (> 1 mg/ mL) 36 , which indicates that they were trained as inactive patterns.Another antibacterial assay using disc diffusion method also showed that ketones and esters (acetate esters) were less potent than corresponding alcohols among monoterpenes 37 .
Rare chemical constituents absent in the training dataset can influence the prediction performance.Our approach treats the constituents as a member of chemical classes if their chemical structures are identified.In this study, 158 rare constituents (observed not in training but in test dataset) were converted to either ChemOnt class, and utilized for prediction.However, 12% (on average) of the total composition was still unavailable in the literature (Supplementary Table S3).Development of an analytical technique and update of the mass spectral database will unveil the unknown composition of the EOs.
The development of the chemical ontologies and automated chemical classification systems has enabled us to easily represent a plant extract as a set of hierarchical chemical classes corresponding to a mixture of diverse compounds.Plant kingdom is estimated to contain between 200,000 and 1 million metabolites 38 .Recently, there is a growing number of papers devoted to antibacterial activity of the plant extracts 19 .The chemical ontologies will help us to predict their biological activity, and to understand potential core structures and functional groups of the metabolites via computational approaches.
Finally, the proposed method has potential limitations.The first is that the classification results depend on the hierarchy of chemical ontology.ChemOnt, the one used in this study, is based on core structures and functional groups.Theoretical model reflecting molecular descriptors may be a promising approach for reflecting molecular shapes and physical properties.The second limitation concerns the quantitativity.Regression analysis of MIC values is theoretically possible, but at present considered to be difficult because of the discrepancy in experimental conditions among studies.Large-scale bioassay data will support the evaluation of quantitative models.
The integration of chemical composition data with structure-based chemical ontology achieved better performance in predicting the antibacterial activity of EOs.Feature selection using hierarchical information criterion is also effective for avoiding overfitting and constructing an interpretable model.The results indicate that machine learning-based classification of the integrated chemical compositions data can be a highly efficient approach for exploring bioactive plant extracts.

Figure 1 .
Figure 1.Overview of this study.The ratios of chemical constituents of that belong to each ontology class are summed to create a new compositional feature of essential oils (EOs). https://doi.org/10.1038/s41598-024-65882-9

Figure 2 .
Figure 2. AUC and partial AUCs using the top K features of hierarchical information criterion (HIC).AUC (A), AUC 0.5 (B), AUC 0.2 (C) and AUC 0.1 (D) vs the number of top K features for HIC are shown.For comparison, those of the first K principal components (PC) obtained by principal component analysis of chemical composition data (without ChemOnt classes) are plotted.

Figure 3 .
Figure 3.Chemical features selected by HIC in ChemOnt hierarchy.(A) The chemical classes in the 32 features are shown as blue circles.The positions of the classes mentioned in the Discussion section are shown in gray.The bar plot indicates the mean chemical composition of the active (red) and inactive (black) EOs.(B) Mean composition rate of the three most abundant constituents in the classes shown in Fig. 3A and those of chemical constituents in Table 2 are shown for active (red) and inactive (black) EOs.2I-5M-3C: 2-Isopropyl-5-methyl-3cyclohexen-1-one.

Table 1 .
AUC and partial AUCs for prediction of two test datasets.Composition data of chemical constituents (Comp) with/without chemical ontology (ChemOnt) classes were trained to classify active/inactive EOs.Prior feature selections were not performed for both input data.Values are means ± SD of 10 iterations, and the significantly better results are highlighted in bold (paired t-test, p < 0.01).AUC: Area under the receiver operating characteristic curve.

Table 2 .
The top 32 features selected by hierarchical information criterion.The chemical ontology (ChemOnt) classes are shown in bold.

Table 3 .
26emical composition, predicted and observed antibacterial activity of commercially available essential oils.*Values in parentheses are the percentage of the total peak area obtained from the total ion current chromatogram.†Thevalues in parentheses are the minimal inhibitory concentrations (MICs in mg/ mL) obtained by the antibacterial assay.The detailed chemical profile determined via GC/MS analysis is presented in Supplementary TableS10and a previous report26.